Error

LW server reports: not allowed.

This probably means the post has been deleted or moved back to the author's drafts.

Shubhorup Biswas 19 Mar 2026 3:19 UTC
1 point
0
Can language models preserve their own alignment?
Wouldn’t consider this an argument for; rather a project proposal to empirically test how much models remain “good”
Shubhorup Biswas 19 Mar 2026 3:01 UTC
1 point
0
Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
They call this scalable oversight but they both train and evaluate the lie detector probes on labeled DolusChat examples. I don’t get why they call it scalable.
Claude’s analysis
https://claude.ai/share/d659b385-f625-4b86-9eb5-f8ce1fea33e5
Shubhorup Biswas 19 Mar 2026 2:40 UTC
1 point
0
Training fails to elicit subtle reasoning in current language models
I think Sonnet 3.5 monitor/Sonnet 3.7 attacker was a narrow intelligence gap.

The paper itself says “as developers continue to scale RL reasoning compute, models may become more capable of subtle reasoning.”

I see this as a minor empirical result with unclear generalisability
Shubhorup Biswas 19 Mar 2026 1:58 UTC
1 point
0
which proxies to train against.
https://www.lesswrong.com/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in?commentId=krg2jzDxXhei9vNLj

and Daniel Kokotajlo comment about preserving at least one output stream that isn’t optimised against(this could be activations, while doing cot+output monitoring)